Question 1: Cumulative Distribution Function (CDF)
Estimation
The following failure times (in hours) were observed for 8 electronic
components:
23, 45, 67, 89, 112, 156, 189, 245
- Write an R function implementing the ECDF \(\hat{F}_n(t)\) according to its
mathematical definition. Validate your implementation using R’s ecdf()
function on the given data, with comparison based on their step
functions.
We are given the following definition for \(\hat{F}_n(t)\)
\[ \hat{F}_n(t) = \frac{1}{n}
\sum_{i=1}^{n} \mathbb{I} \left(t_n < t \right)\] where,
\[ \mathbb{I}(t) =
\begin{cases}
0 & t_n > t\\
1 & t_n \leq t
\end{cases}
\]
random_sample <- c(23, 45, 67, 89, 112, 156, 189, 245)
sams_ecdf <- function(x) {
n <- length(x)
function (t) {
mat <- outer(x, t, FUN="<=")
return (colSums(mat) / n)
}
}
#some tests to check my function behaves correctly
r_lang_ecdf <- stats::ecdf(random_sample)
sam_ecdf <- sams_ecdf(random_sample)
test_points <- seq(20, 250, 1)
results <- r_lang_ecdf(test_points) == sam_ecdf(test_points)
print(paste("fails: ", length(results[!results])))
[1] "fails: 0"
print(paste("passes: ", length(results[results])))
[1] "passes: 231"
x_points <- seq(20, 250, 0.5)
plot <- ggplot(data=data.frame(x=x_points, y_1=sam_ecdf(x_points), y_2=r_lang_ecdf(x_points))) +
#geom_point(aes(x=x, y=y_1, color="Sam's ECDF")) +
#geom_point(aes(x=x, y=y_2, color="R's ECDF"))
geom_line(aes(x=x, y=y_1, color="Sam's ECDF")) +
geom_line(aes(x=x, y=y_2, color="R's ECDF"), linetype="dashed") +
labs(x="failure times", y="cumlative density") +
labs(title="Comparison of ECDF Implimentations")
ggplotly(plot)
- A colleague claims that the probability of failure before 100 hours
is 0.5 based on these data. Do you agree? Explain your reasoning using
the empirical cumulative distribution function (ECDF).
random_sample <- c(23, 45, 67, 89, 112, 156, 189, 245)
r_ecdf <- stats::ecdf(random_sample)
print(r_ecdf(100))
[1] 0.5
Here we compute ECDF(100 hours). This is the porportion of data
values that are less than or equal to 100. This value approximates the
true CDF, the probability that a failure occurs at or before 100 hours.
From the given data we get that ECDF(100) = 0.5, which is what the a
colleague claims is the probability of failure. I agree that from these
data, 0.5 is a reasonable estimation for the probability of failure
before 100 hours.
Question 2: Density Function Estimation
Consider the following failure times from a mechanical system:
12.3, 14.7, 15.2, 16.8, 18.1, 19.4, 20.6, 22.3, 23.9, 25.4
- Create a histogram of the data using 3 equally spaced bins. What is
the estimated density in each bin? Describe the shape of the histogram’s
distribution.
random_sample <- c(12.3, 14.7, 15.2, 16.8, 18.1, 19.4, 20.6, 22.3, 23.9, 25.4)
d_frame <- data.frame(x=random_sample)
plot <- ggplot(data=d_frame, aes(x = x)) +
geom_histogram(aes(y = after_stat(density)),bins=3)
ggplotly(plot)
Here, the histogram has a rough mound shape centered at 18.85.
The density of each bin: bin 1. 0.0458 bin 1. 0.061 bin 3. 0.0458
- Write an R function that computes kernel density estimates using a
Gaussian kernel with \(h=2\). Validate
your implementation against R’s built-in density() function.
\[
\hat{f}_h(t) = \frac{1}{nh}\sum_{i=1}^n K\left( \frac{t-t_i}{h}\right),
\ \ \text{ where } \ \ K(u) = \frac{1}{\sqrt{2\pi}} e^{-u^2/2}.
\]
kernel_density <- function(sample, K=dnorm, h=2) {
n <- length(sample)
function(t, h=2) {
k <- K( outer(t, sample, FUN="-") / h)
s <- rowSums(k)
return (s / n / h)
}
}
x <- c(12.3, 14.7, 15.2, 16.8, 18.1, 19.4, 20.6, 22.3, 23.9, 25.4)
my_gaussian_kernel_density <- kernel_density(x)
r_density <- density(x, bw=2, kernel="gaussian")
plot <- ggplot(data=data.frame(x=r_density$x, y=r_density$y, y_2=my_gaussian_kernel_density(r_density$x))) +
geom_line(aes(x=x, y=y_2, color="sam")) +
geom_line(aes(x=x, y=y, color="r"), linetype="dashed")
ggplotly(plot)
- Write a custom R function that computes kernel density estimates
using the Epanechnikov kernel with \(h=2\). Validate your implementation by
comparing results with R’s built-in density() function for Gaussian
kernel estimation.
\[
\hat{f}_h(t) = \frac{1}{nh}\sum_{i=1}^n K\left( \frac{t-t_i}{h}\right),
\ \ \text{ where } \ \ K(u) = \frac{3}{4}(1 - u^2) \ \ \text{ for } \ \
|u| \le 1.
\]
x <- c(12.3, 14.7, 15.2, 16.8, 18.1, 19.4, 20.6, 22.3, 23.9, 25.4)
K <- function(u) {
mask <- abs(u) <= 1
return (((3/4) * (1 - u^2)) * mask)
}
kernel_density <- function(sample, K=dnorm) {
function(t, h=2) {
n <- length(sample)
#compute a matrix where each row are the
#terms to sum for each element of t
k <- K(outer(t, sample, FUN="-") / h)
#sum for each value of t
s <- rowSums(k)
#scale and return
return (s / h / n)
}
}
my_epanechnikov_kernel_density <- kernel_density(x, K=K)
#x_axis <- seq(5, 30, 0.01)
r_density <- density(x, kernel="gaussian", bw=2)
plot <- ggplot(data=data.frame(x=r_density$x, y=r_density$y, y_1=my_epanechnikov_kernel_density(r_density$x))) +
geom_line(aes(x=x, y=y, color="gaussian")) +
geom_line(aes(x=x, y=y_1, color="Sams epanechnikov"), linetype="dashed")
plot

#plot(density(x, kernel="gaussian", bw=2), main="Gaussian and epanechnikov kernels")
#points(x=x_axis, y=my_epanechnikov_kernel_density(x_axis), pch = 20, cex = 0.1)
#p
- How does the choice of kernel (Gaussian vs. Epanechnikov) affect the
density estimate? For both kernel estimators applied to this dataset,
what happens when we select \(h=1.5\)
versus \(h=2.5\)?
x <- c(12.3, 14.7, 15.2, 16.8, 18.1, 19.4, 20.6, 22.3, 23.9, 25.4)
plot <- ggplot(data=data.frame(x=x), aes(x=x)) +
geom_density(bw=1.5, kernel="gaussian", aes(color="gausian h=1.5")) +
geom_density(bw=1.5, kernel="epanechnikov", aes(color="epanechnikov h=1.5")) +
geom_density(bw=2.5, kernel="gaussian", aes(color="gausian h=2.5")) +
geom_density(bw=2.5, kernel="epanechnikov", aes(color="epanechnikov h=2.5")) +
labs(title="Impact of Binwidth and Kernel on Density Approximation")
ggplotly(plot)
Here it seems that lower values of \(h\) correspond with bumpier approximations
while higher values of \(h\) correspond
with smoother approximations. The choice of kernel does not have as
large an impact as choice of \(h\), but
the gaussian kernel generally appears smoother than the epanechnikov
kernel for the same bandwidth.
---
title: "Assignment 1: Estimating CDF and PDF"
author: "Samuel Johnson"
header-includes:
  - \usepackage{amssymb}
date: " Due: 2/3/2026"
output:
  html_document: 
    toc: yes
    toc_depth: 4
    toc_float: yes
    number_sections: no
    toc_collapsed: yes
    code_folding: hide
    code_download: yes
    smooth_scroll: yes
    theme: lumen
  pdf_document: 
    toc: yes
    toc_depth: 4
    fig_caption: yes
    number_sections: yes
    fig_width: 3
    fig_height: 3
  word_document: 
    toc: yes
    toc_depth: 4
    fig_caption: yes
    keep_md: yes
editor_options: 
  chunk_output_type: inline
---

```{css, echo = FALSE}
#TOC::before {
  content: "Table of Contents";
  font-weight: bold;
  font-size: 1.2em;
  display: block;
  color: navy;
  margin-bottom: 10px;
}


div#TOC li {     /* table of content  */
    list-style:upper-roman;
    background-image:none;
    background-repeat:none;
    background-position:0;
}

h1.title {    /* level 1 header of title  */
  font-size: 22px;
  font-weight: bold;
  color: DarkRed;
  text-align: center;
  font-family: "Gill Sans", sans-serif;
}

h4.author { /* Header 4 - and the author and data headers use this too  */
  font-size: 15px;
  font-weight: bold;
  font-family: system-ui;
  color: navy;
  text-align: center;
}

h4.date { /* Header 4 - and the author and data headers use this too  */
  font-size: 18px;
  font-weight: bold;
  font-family: "Gill Sans", sans-serif;
  color: DarkBlue;
  text-align: center;
}

h1 { /* Header 1 - and the author and data headers use this too  */
    font-size: 20px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: center;
}

h2 { /* Header 2 - and the author and data headers use this too  */
    font-size: 18px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h3 { /* Header 3 - and the author and data headers use this too  */
    font-size: 16px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h4 { /* Header 4 - and the author and data headers use this too  */
    font-size: 14px;
  font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: left;
}

/* Add dots after numbered headers */
.header-section-number::after {
  content: ".";

body { background-color:white; }

.highlightme { background-color:yellow; }

p { background-color:white; }

}
```

```{r setup, include=FALSE}
# code chunk specifies whether the R code, warnings, and output 
# will be included in the output files.
if (!require("knitr")) {
   install.packages("knitr")
   library(knitr)
}
if (!require("pander")) {
   install.packages("pander")
   library(pander)
}
if (!require("ggplot2")) {
  install.packages("ggplot2")
  library(ggplot2)
}
if (!require("tidyverse")) {
  install.packages("tidyverse")
  library(tidyverse)
}

if (!require("plotly")) {
  install.packages("plotly")
  library(plotly)
}
####
knitr::opts_chunk$set(echo = TRUE,       # include code chunk in the output file
                      warning = FALSE,   # sometimes, you code may produce warning messages,
                                         # you can choose to include the warning messages in
                                         # the output file. 
                      results = TRUE,    # you can also decide whether to include the output
                                         # in the output file.
                      message = FALSE,
                      comment = NA
                      )  
```
 
 \
 
## **Assignment Objectives** 

* Develop a clear technical understanding of nonparametric cumulative distribution function (CDF) estimation and various kernel density estimators.

* Translate mathematical formulas into R functions and apply them to solve related problems.

* Create effective visualizations to demonstrate your understanding of key concepts in the following questions.



\

## **Question 1: Cumulative Distribution Function (CDF) Estimation**

The following failure times (in hours) were observed for 8 electronic components:

<center> 23, 45, 67, 89, 112, 156, 189, 245  </center>

a) Write an R function implementing the ECDF $\hat{F}_n(t)$ according to its mathematical definition. Validate your implementation using R's ecdf() function on the given data, with comparison based on their step functions.

We are given the following definition for $\hat{F}_n(t)$

$$ \hat{F}_n(t) = \frac{1}{n} \sum_{i=1}^{n} \mathbb{I} \left(t_n < t \right)$$
where,

$$ \mathbb{I}(t) = 
  \begin{cases}
      0 &  t_n > t\\
      1 & t_n \leq t
  \end{cases} 
$$
```{r}
random_sample <- c(23, 45, 67, 89, 112, 156, 189, 245)


sams_ecdf <- function(x) {
  n <- length(x)
  function (t) {
    mat <- outer(x, t, FUN="<=")
    return (colSums(mat) / n)
  }
}

#some tests to check my function behaves correctly
r_lang_ecdf <- stats::ecdf(random_sample)
sam_ecdf <- sams_ecdf(random_sample)

test_points <- seq(20, 250, 1)
results <- r_lang_ecdf(test_points) == sam_ecdf(test_points)
print(paste("fails: ", length(results[!results])))
print(paste("passes: ", length(results[results])))
```
```{r}
x_points <- seq(20, 250, 0.5)
plot <- ggplot(data=data.frame(x=x_points, y_1=sam_ecdf(x_points), y_2=r_lang_ecdf(x_points))) +
  #geom_point(aes(x=x, y=y_1, color="Sam's ECDF")) + 
  #geom_point(aes(x=x, y=y_2, color="R's ECDF"))
  geom_line(aes(x=x, y=y_1, color="Sam's ECDF")) + 
  geom_line(aes(x=x, y=y_2, color="R's ECDF"), linetype="dashed") +
  labs(x="failure times", y="cumlative density") + 
  labs(title="Comparison of ECDF Implimentations")

ggplotly(plot)
```


b) A colleague claims that the probability of failure before 100 hours is 0.5 based on these data. Do you agree? Explain your reasoning using the empirical cumulative distribution function (ECDF).

```{r}
random_sample <- c(23, 45, 67, 89, 112, 156, 189, 245)
r_ecdf <- stats::ecdf(random_sample)
print(r_ecdf(100))
```
Here we compute ECDF(100 hours). This is the porportion of data values that are less than or equal to 100. This value approximates the true CDF, the probability that a failure occurs at or before 100 hours. From the given data we get that ECDF(100) = 0.5, which is what the a colleague claims is the probability of failure. I agree that from these data, 0.5 is a reasonable estimation for the probability of failure before 100 hours.


## **Question 2: Density Function Estimation**

Consider the following failure times from a mechanical system:

<center> 12.3, 14.7, 15.2, 16.8, 18.1, 19.4, 20.6, 22.3, 23.9, 25.4 </center>

a) Create a histogram of the data using 3 equally spaced bins. What is the estimated density in each bin? Describe the shape of the histogram's distribution.

```{r}
random_sample <- c(12.3, 14.7, 15.2, 16.8, 18.1, 19.4, 20.6, 22.3, 23.9, 25.4)
d_frame <- data.frame(x=random_sample)
plot <- ggplot(data=d_frame, aes(x = x)) +
  geom_histogram(aes(y = after_stat(density)),bins=3)
ggplotly(plot)
```
Here, the histogram has a rough mound shape centered at 18.85.

The density of each bin:
bin 1. 0.0458
bin 1. 0.061
bin 3. 0.0458


b) Write an R function that computes kernel density estimates using a Gaussian kernel with $h=2$. Validate your implementation against R's built-in density() function.

$$
\hat{f}_h(t) = \frac{1}{nh}\sum_{i=1}^n K\left( \frac{t-t_i}{h}\right), \ \ \text{ where } \ \ K(u) = \frac{1}{\sqrt{2\pi}} e^{-u^2/2}.
$$
```{r}


kernel_density <- function(sample, K=dnorm, h=2) {
  n <- length(sample)
  function(t, h=2) {
    k <- K( outer(t, sample, FUN="-") / h)
    s <- rowSums(k)
    return (s / n / h)
  }
}

x <- c(12.3, 14.7, 15.2, 16.8, 18.1, 19.4, 20.6, 22.3, 23.9, 25.4)

my_gaussian_kernel_density <- kernel_density(x)

r_density <- density(x, bw=2, kernel="gaussian")

plot <- ggplot(data=data.frame(x=r_density$x, y=r_density$y, y_2=my_gaussian_kernel_density(r_density$x))) + 
  geom_line(aes(x=x, y=y_2, color="sam")) +
  geom_line(aes(x=x, y=y, color="r"), linetype="dashed")
  
ggplotly(plot)

```
c) Write a custom R function that computes kernel density estimates using the Epanechnikov kernel with $h=2$. Validate your implementation by comparing results with R's built-in density() function for Gaussian kernel estimation.

$$
\hat{f}_h(t) = \frac{1}{nh}\sum_{i=1}^n K\left( \frac{t-t_i}{h}\right), \ \ \text{ where } \ \ K(u) = \frac{3}{4}(1 - u^2) \ \ \text{ for } \ \ |u| \le 1.
$$

```{r}

x <- c(12.3, 14.7, 15.2, 16.8, 18.1, 19.4, 20.6, 22.3, 23.9, 25.4)

K <- function(u) {
    mask <- abs(u) <= 1
    return (((3/4) * (1 - u^2)) * mask)
}

kernel_density <- function(sample, K=dnorm) {
  function(t, h=2) {
    n <- length(sample)
    #compute a matrix where each row are the
    #terms to sum for each element of t
    k <- K(outer(t, sample, FUN="-") / h)
    #sum for each value of t
    s <- rowSums(k)
    #scale and return
    return (s / h / n)
  }
}

my_epanechnikov_kernel_density <- kernel_density(x, K=K)
#x_axis <- seq(5, 30, 0.01)

r_density <- density(x, kernel="gaussian", bw=2)
plot <- ggplot(data=data.frame(x=r_density$x, y=r_density$y, y_1=my_epanechnikov_kernel_density(r_density$x))) +
  geom_line(aes(x=x, y=y, color="gaussian")) + 
  geom_line(aes(x=x, y=y_1, color="Sams epanechnikov"), linetype="dashed")
  
plot

#plot(density(x, kernel="gaussian", bw=2), main="Gaussian and epanechnikov kernels")
#points(x=x_axis, y=my_epanechnikov_kernel_density(x_axis), pch = 20, cex = 0.1)

#p
```



d) How does the choice of kernel (Gaussian vs. Epanechnikov) affect the density estimate? For both kernel estimators applied to this dataset, what happens when we select $h=1.5$ versus $h=2.5$?

```{r}
x <- c(12.3, 14.7, 15.2, 16.8, 18.1, 19.4, 20.6, 22.3, 23.9, 25.4)
plot <- ggplot(data=data.frame(x=x), aes(x=x)) +
  geom_density(bw=1.5, kernel="gaussian", aes(color="gausian h=1.5")) +
  geom_density(bw=1.5, kernel="epanechnikov", aes(color="epanechnikov h=1.5")) +
  geom_density(bw=2.5, kernel="gaussian", aes(color="gausian h=2.5")) +
  geom_density(bw=2.5, kernel="epanechnikov", aes(color="epanechnikov h=2.5")) +
  labs(title="Impact of Binwidth and Kernel on Density Approximation")

ggplotly(plot)

```
Here it seems that lower values of $h$ correspond with bumpier approximations while higher values of $h$ correspond with smoother approximations. The choice of kernel does not have as large an impact as choice of $h$, but the gaussian kernel generally appears smoother than the epanechnikov kernel for the same bandwidth.



